The data for the current study was sourced from the Bureau of Transportation Statistics website which is regulated by The US Department of Transportation and can be accessed through this link to the dataset.
The governing body of transportation for the US which is the Department of Transportation (USDOT) manages the usage policies for the publicly available data as cited through the link in the above section.
While there are no specific licenses attributed to the current usage of the publicly available dataset, DOT however encourages the use of the available data with a CC-BY attribution license. The CC-BY license lets us distribute, remix, tweak, and build upon our work, even commercially, as long as we credit the original creation.The license can be looked into further detail through the link to licensing details.
The process for obtaining the dataset and storing them in an RDatabase file is as follows:
The dataset is publicly available in the Bureau of Transpotation Statistics website and can be accessed here.
Click on “Aviation” as highlighted in the image below.
Once the variables have been selected for the desired timeline, we need to click on the blue “Download” button button to obtain the .csv file in our local hard drive.
The data downloaded from the source through the above process mentioned in the previous section is obtained as a “.csv” file. However, the same needs to be saved as a “Rdatabase file (.Rda)”. The Rdatabase file stores the dataframe directly. We can write the data from the .csv to .rda file using the “Save” function and load the same in an R session using the “load” function.
The code for saving the dataset in the .Rda file is shown through the code below.
df_19<- read.csv('data/T_ONTIME_REPORTING.csv') # Assigning the data from the .csv file to a dataframe named "df_19"
save(df_19,file= 'data/airports.rda') # Saving the file as a .rda fileTo get started with reading the data, we can use the following r-code chunk :
library(tidyverse)
library(emo)
load(here::here("data/airports.rda")) #Loading the data into a R sessionOnce the data has been read, we need to process the data to create a subset of the large dataset that will be required to create the final analysis. This would require us to filter some of the relevant parts of the initial dataset. Some of the steps for data processing are as follows :
df_SEA_las <- df_19 %>% filter(ORIGIN %in% c('SEA','LAS')) #Filtering dataframe containing "SEA" and "LAS" as the origin airports.df_SEA_las <- df_SEA_las %>% filter(DEP_DELAY <=720 | ARR_DELAY <=720) #Filtering data with departure delay and arrival delay less than or equal to 720 minutes.The data is in the file airports.rda in the
data directory. It contains these variables:
The current dataset contains all the US carrier flight details for the month of December, 2019 across all the airports in the country.The reason for choosing this timeline is as follows :
December has historically been the busiest part of the year for all airports due to the holiday season. In order to extend another connectivity within the US, it will be beneficial to choose the airport which provides for the maximum number of connections to other parts of the country during the times of peak passenger footfall in the airports.
As the aviation industry was hit drastically by the COVID-19 pandemic during the years 2020-2021, hence, the study mainly focuses on the pre-pandemic period (2019) when the airline traffic and passenger behavior were not altered by the lockdowns and restrictions.
The population of the raw dataset downloaded from the Bureau of Transportation Statistics contain all the flights undertaken by the various domestic US airline carriers in the month of December 2019.The dataset contains information of the carrier ID, airports, cities and states traveled to as well as the duration of each flight along with any delays encountered.
Since the filtered dataset is of a timeline of one month, the sample may not be considered as a representation of the entire dataset as there are a multitude of factors that can affect the operations of the airline aswell as the passenger behaviour throughout the year. Hence, the results for a particular month may be heavily biased when compared for a sample with another month’s data.
The reason for the selection of the variables for the current dataset have been delineated as follows :
DAY_OF_WEEK : This variable would help us understand whether the airports are especially busy on certain days of the week. We can decide which would be the most ideal day/days in the week where passengers would likely fly to either of the airports (SEA or LAS) as final destinations or for further connections.
FL_DATE : This variable provides us the exact date of flight travel. It can help us aggregate the various number of flight connections on different days of the month. Aggregating the flights through a visualisation such as a barplot may help us understand whether the connections have increased, decreased or remained constant throughout the month.
OP_UNIQUE_CARRIER : This variable will help us identify the airline carrier which is completing a particular trip. We can use this variable to understand which carriers are the most active and travel to major as well as the regional airports of the US from SEA and LAS.
ORIGIN : This variable indicates the point of initiation of a flight trip. In particular, this variable provides us with the takeoff airport.
ORIGIN_CITY_NAME : This variable indicates the city where the takeoff airport is situated in. We can use this variable to understand the popular flight connections in the US.
ORIGIN_STATE_ABR : This variable indicates the state where the takeoff airport is situated in. We can use this variable to understand the popular flight connections in the US.
DEST : This variable indicates the termination of the flight. In particular, this variable provides us with the landing airport.
DEST_CITY_NAME : This variable indicates the city where the landing airport is situated in. We can use this variable to understand the popular flight connections in the US.
DEST_STATE_ABR : This variable indicates the state where the landing airport is situated in. We can use this variable to understand the popular flight connections in the US.
DEP_TIME : This variable indicates the time of departure for a flight. We can understand the most common hours of flight travel in a day. This would additionally help us schedule the connections of Qantas to SEA or LAS such that the passengers have as many options to avail any onward connections to other states.
DEP_DELAY : This variable indicates the delay of departure of flights in minutes. We can compare which airport amongst SEA and LAS suffer from high delays. Delays are to be avoided as they impact the operations of the airline carrier negatively.
ARR_TIME : This variable indicates the time of arrival for a flight. We can understand the most common hours of flight landing in a day. This would additionally help us schedule the connections of Qantas to SEA or LAS such that the passengers have as many options to avail any onward connections to other states.
ARR_DELAY : This variable indicates the delay of departure of flights in minutes. We can compare which airport among SEA and LAS suffer from high delays. These delays could indicate the traffic congestion that might occur in either of the airports and can be avoided.
CANCELLED : This variable indicates whether any of the connections were cancelled at the respective airports. Airports with high cancellations should be avoided.
FLIGHTS : This variable indicates the number of flights inbound and outbound. This will help us understand the level of activity in SEA or LAS.
CARRIER_DELAY : This variable indicates whether a delay occurred as a result of the operations of the airline carrier. Airports reporting high carrier delays would not be favourable for passengers to choose an onward connection to another state.
WEATHER_DELAY : This variable indicates the minutes of delay as a result of weather. This could be an important variable to look into as December reports inclement weather in the US and could affect the airline operations. Hence, the effects of weather for each airport has to be studied to decide on the choice of SEA or LAS as the new connection for QANTAS.
While the data collection has been done in an extensive and granular manner with multitudes of information available, some of the limitations that have been observed are as follows :
The arrival and departure times are not recorded as time stamps. As a result, it is difficult to interpret the data. Moreover, since these times are recorded as per the local airport timezone, it is not possible to readily use this data for mathematical operations such as subtraction to calculate the flight duration.
In order to create any temporal analysis, it is important to convert the timezones into a standard timezone for all the observations. Currently, the timestamps vary based on the regional airport timezone.
Moreover, if the departure time and arrival times are from different days, the subtraction will show an incorrect flight duration.
Some of the delays recorded as Arrival and Departure delays are greater than a full day. This is unlikely to be an actual scenario and may bias the analysis.
The dataset here is an observational data and in particular, a census data. Some of the limitations that are prevalent in such datasets are as follows :
__________________________________ End of file ____________________________________